Chinese Word Segmentation at Peking University
نویسندگان
چکیده
Word segmentation is the first step in Chinese information processing, and the performance of the segmenter, therefore, has a direct and great influence on the processing steps that follow. Different segmenters will give different results when handling issues like word boundary. And we will present in this paper that there is no need for an absolute definition of word boundary for all segmenters, and that different results of segmentation shall be acceptable if they can help to reach a correct syntactic analysis in the end. Keyword: automatic Chinese word segmentation, word segmentation evaluation, corpus, natural language processing
منابع مشابه
A Two-stage Statistical Word Segmentation System for Chinese
In this paper we present a two-stage statistical word segmentation system for Chinese based on word bigram and wordformation models. This system was evaluated on Peking University corpora at the First International Chinese Word Segmentation Bakeoff. We also give results and discussions on this evaluation.
متن کاملTowards a Hybrid Model for Chinese Word Segmentation
This paper describes a hybrid Chinese word segmenter that is being developed as part of a larger Chinese unknown word resolution system. The segmenter consists of two components: a tagging component that uses the transformation-based learning algorithm to tag each character with its position in a word, and a merging component that transforms a tagged character sequence into a word-segmented sen...
متن کاملA Maximum Entropy Approach to Chinese Word Segmentation
We participated in the Second International Chinese Word Segmentation Bakeoff. Specifically, we evaluated our Chinese word segmenter in the open track, on all four corpora, namely Academia Sinica (AS), City University of Hong Kong (CITYU), Microsoft Research (MSR), and Peking University (PKU). Based on a maximum entropy approach, our word segmenter achieved the highest F measure for AS, CITYU, ...
متن کاملChinese Word Segmentation with Maximum Entropy and N-gram Language Model
This paper presents the Chinese word segmentation systems developed by Speech and Hearing Research Group of National Laboratory on Machine Perception (NLMP) at Peking University, which were evaluated in the third International Chinese Word Segmentation Bakeoff held by SIGHAN. The Chinese character-based maximum entropy model, which switches the word segmentation task to a classification task, i...
متن کاملAn integrated approach for Chinese word segmentation
This paper presents an integrated approach for Chinese word segmentation, which can perform disambiguation and unknown word identification simultaneously on the input. In this work, a hybrid model is used to score known word candidates and unknown word candidates equally by incorporating the modified word-formation models (viz. word-juncture models and wordformation patterns) into word bigram m...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003